Our team has been assigned to study in a descriptive way the Seoul bike sharing service. For this purpose we will use both descriptive and inferential statistics in order to find correlations, trends and patterns between the different variables of the ‘SeoulBike’ database.
To what extent has the Seoul bike sharing service been a success since its start-date ?
How much can this time path evolution be explained by the dataset’s variables ?
First of all, we import the ‘SeoulBike’ dataset which is a .csv file, using the read.csv() function. The database focuses on the number of rented bikes in Seoul from December 2017 to November 2018.
The dataset is made of 8760 observations and 14 variables. There are several meteorological variables among which the temperature in °C, the level of solar radiation, the wind speed…
The other variables are time related. For each hour of day corresponds a rented bike count, which means there are 24 counts per day.
Before digging into the data analysis it is essential to transform the ‘Date’ variable to the appropriate format.
Here is a summary of the types of variables included in the dataframe. There are both qualitative and quantitative variables.
| variable | class |
|---|---|
| Date | Date |
| Rented.Bike.Count | integer |
| Hour | integer |
| Temperature..C. | numeric |
| Humidity… | integer |
| Wind.speed..m.s. | numeric |
| Visibility..10m. | integer |
| Dew.point.temperature..C. | numeric |
| Solar.Radiation..MJ.m2. | numeric |
| Rainfall.mm. | numeric |
| Snowfall..cm. | numeric |
| Seasons | character |
| Holiday | character |
| Functioning.Day | character |
As we have a huge number of observations in the dataset,it will not be needed to verify the normality hypothesis of the samples to carry out the statistical tests.
In order to optimize our code we created an automated t-test function whose alternative is ‘greater’.
Going into the data, we found that no bikes were rented during the no functioning days. We think it may be due to the Seoulite cultural landscape. That’s why we assume the bike rental service is closed during these days, as well as banks, post offices,…
In order to be more accurate we decided to delete the rows related to the no functioning days.
| Total Rented Bike Count | |
|---|---|
| Functioning day | 6,172,314 |
| No functioning day | 0 |
The previous correlation matrix indicates a strong correlation between ‘Dew.Point.temperature..C.’ and ‘Temperature..C.’ which means a huge colinearity of the two variables.
In the mean time, there are 5 insignificant correlations. The latter match the zeros in the matrix.
Here are the variables with positive influence on the number of rented bikes (decreasing order) :
On the other side these are the variables with negative impact on the number of rented bikes (decreasing order) :
Above all, this variable counts the hourly number of rented bikes for each day between December 2017 and November 2018.
| Min | 25% | Median | 75% | Max | Mean | Sd | |
|---|---|---|---|---|---|---|---|
| Rented.Bike.Count | 2 | 214 | 542 | 1 084 | 3 556 | 729.2 | 642.4 |
Looking at the table we notice an important scope of the ‘Rented.Bike.Count’ variable which means that demand for rental bikes has been fluctuating during the whole period.
From this plot we can approximate the hourly rented bike count’s expected value using the empirical mean as estimator : \[\widehat{\mathbb{E}(X)} = \overline{X}\]
where \(X\) stands for the rented bike count variable. We find \(\overline{x} \approx 729\).
This result implies there is about 50% chance that the hourly number of rented bikes be lower (resp. greater) than 729.
On the following plot, an increase of the rented bikes count is noted from March to October. The better the weather is, the more people ride their bike.
Furthermore, an overall increase of the number of rented bikes draws attention between the first (December 2017) and last (November 2018) months. We’ll check its significance by testing it.
There also might be seasonality in the time series which might be caused by the weather. If we had the data over a larger period we may observe variations in the rented bike count that occur at specific regular intervals. It could be a regular increase from the end of Spring to the beginning of Autumn and a regular decrease from Autumn to Spring.
In order to answer the previous question we compute two samples from the ‘SeoulBike’ dataset : the first one dealing with the data related to December 2017 - the bike sharing service’s start month - and the second one representing the November 2018 data which is the last month of the database.
One cannot but admit that the rented bike count has almost increased threefold. We will try to explain this important rise.
| Date | Rented.Bike.Count |
|---|---|
| 2017-12-01 | 254 |
| 2017-12-01 | 204 |
| 2017-12-01 | 173 |
| 2018-11-01 | 584 |
| 2018-11-01 | 524 |
| 2018-11-01 | 362 |
By using the means of the Student test function we automated before, we compare the two samples’ means to check whether the number of rented bikes is different between the two periods.
We compute the following t-test with a 5\(\%\) first species risk.
\[\left\{ \begin{array}{ll} H_0 : & \mu_1 = \mu_2 \\ H_1 : & \mu_1 > \mu_2 \end{array} \right.\]
\(\mu_1\) stands for the second sample rented bike count’s expected value and \(\mu_2\) the first one’s.
| mean in Dec 18 | mean in Dec 17 | p-value | |
|---|---|---|---|
| Test results | 718.7 | 249.1 | 7.858e-102 |
The t-test’s p-value being basically equal to 0, it can be said that \(\mu_1\) is significantly higher than \(\mu_2\). In other words the number of rented bikes has significantly increased since its start-date.
The following plots depict this positive evolution. The daily count of rented bikes is plotted for both December 2017 and November 2018.
Computing the percent change between the two daily averages we found that the daily average rented bike count has increased by about 189%, that is to say it has almost been tripled over the period.
The next two parts will aim at finding relationships between the rented bike count variable and the other ones in order to explain the Seoul service’s success.
Firstly we decided to aggregate the data by suming the number of bikes that have been rented during a month.
We will seperate the months into two classes on the basis of the monthly rented bike count median.
From the table, one notes a dichotomy between months. June, July, May, Sept, Aout, Oct stand apart, especially June, with a total of nearly 900 000 rented bikes. This may reflect the more convenient weather.
On the opposite, winter-related months such as January, February and December don’t do quite well. Indeed, the total amount of these 3 months do not even reach half a million. How can this be explained ? These are cold months. Moreover, the service has only been set up in December. It had not reached its maturity yet.
In this part we computed some statisical indicators about the daily number of rented bikes. What is striking in the following tables is the increase of both the mean and the variability of the number of bikes which are rented each day.
| Month | Mean | sd | Error bound |
|---|---|---|---|
| Jan | 4838.90 | 1395.29 | 511.80 |
| Feb | 5422.61 | 1917.06 | 743.36 |
| Mar | 12277.23 | 4040.71 | 1482.15 |
| Apr | 18076.79 | 7563.76 | 2877.10 |
| May | 23569.60 | 8668.03 | 3236.70 |
| Jun | 29896.23 | 6226.19 | 2324.90 |
| Jul | 23692.26 | 7439.63 | 2728.88 |
| Aug | 21028.61 | 5174.10 | 1897.88 |
| Sep | 25908.15 | 6207.25 | 2507.16 |
| Oct | 23238.39 | 5867.61 | 2275.22 |
| Nov | 17248.70 | 5043.17 | 1995.01 |
| Dec | 5978.39 | 1943.16 | 712.76 |
Thanks to the following chart, our previous assumptions are confirmed analytically. June is clearly above other months and the three winter-related months had a hard time compared to the other months.
The rise in the confidence intervals’ width illustrates that the more bikes are rented, the more fluctuation appears.
Deeping into the study, we had to focus on the hourly rented bike count. We’ve cut the days into 4 periods :
At first sight, it seems like the bike sharing service’s number of users increases from 6 a.m. to 6 p.m. then decreases until reaching its minimum level at 4 a.m.
Are to be compared the daily time and the night time rents. Test results show quite a huge and significant gap between the two periods’ means with an extremely low p-value. Therefore, the null hypothesis of equality of means is rejected.
| mean in group DayTime | mean in group NightTime | p-value | |
|---|---|---|---|
| Test results | 817.6 | 624.5 | 6.174e-44 |
Another comparison was made, maybe a little bit less obvious : number of renting bikes in the afternoon versus in the evening. The associated test led us to this conclusion : there isn’t a meaningful difference between the two means.
| Afternoon | Evening | p-value | |
|---|---|---|---|
| Test results | 1 016 | 1 011 | 0.424 |
At that point, one question arises : how could the Seoul bike sharing service optimize its supply of bikes during daytime ? Given our observations and test results, it might be wise to prioritize the service between 7am to 10 pm.
Is the number of rented bikes influenced by the holidays ?
In response, we decided to draw a boxplot representing the rented bike count depending on the two-levels-variable ‘Holiday’.
We can easily notice the median on vacation time is half the size of the other one. Additionally, each “No Holidays” quantile is much more higher than its “rival”. This probably reflects a negative impact of the holidays.
Besides, we discern a lower spread of rented bikes on holidays, whereas the higher values tend to skyrocket on “No Holidays”.
To ensure we are not wrong, we obviously needed to test it. We wanted to know whether or not the impact of holidays on bike rental is statistically significant. To do so, we carried out another mean-test.
| mean in group No Holiday | mean in group Holiday | p-value | |
|---|---|---|---|
| Test results | 739.3 | 529.2 | 1.501e-12 |
The results are clear. Holidays bring a noteworthy impact on the number of rented bikes.
Now that we have found patterns between the time variables and the rented bike count variable, it seems relevant to focus on the other part of the dataset. It’s time to use the weather-related variables to explain the evolution of the Seoul bike sharing service.
| Min | 25% | Median | 75% | Max | Mean | Sd | |
|---|---|---|---|---|---|---|---|
| Temperature..C. | -17.8 | 3 | 13.5 | 22.7 | 39.4 | 12.77 | 12.1 |
| Visibility..10m. | 27 | 935 | 1 690 | 2 000 | 2 000 | 1 434 | 609.1 |
| Solar.Radiation..MJ.m2. | 0 | 0 | 0.01 | 0.93 | 3.52 | 0.5679 | 0.8682 |
| Wind.speed..km.h. | 0 | 3.24 | 5.4 | 8.28 | 26.64 | 6.213 | 3.723 |
As shown in both the density plot and the previous table, the Seoul temperature fluctuates quite much. From the plot we can divide the temperature’s distribution into two distinct groups : cold and warm temperatures.
These two features are the consequence of the city’s continental climate.
The following stacked density graph highlights the fact developed above. The Winter’s density is almost perfectly symmetrical to the Summer’s. The Spring’s and Autumn’s densities can be viewed as transition periods between the two opposite seasons.
Seoul is not known to be a sunny place. On top of that, the solar radiation level does not fluctuate that much.
On average the Seoul wind speed is equal to 6 km.h which is much lower than the worldwide average (~35 km/h). The low variability between seasons indicates Seoul is a city in which there is very little wind throughout the year. This may be a good point for bike rental.
Winter has the lowest median and the number of rented bikes is less spread than for the other seasons. There is no need to carry out a test to verify whether the number of rented bikes is lower during Winter.
However, the boxplots for the three other seasons led us to conduct a one-way anova test for comparing means.
The test’s hypothesis are defined as follow :
\[\left\{ \begin{array}{ll} H_0 : & \mu_i = \mu\ ;\ \forall i =1,2,3 \\ H_1 : & \exists \ i \neq j \ | \ \mu_i \neq \mu_j \end{array} \right.\]
where \(\mu_i\) represents the rented bike count expected value for the season \(i\).
| Test results | |
|---|---|
| Df | 2.00 |
| Test statistic | 110.63 |
| p-value | 0.00 |
As the p-value is less than the 0.05 significance level, we can conclude there are significant differences in terms of rented bikes among the seasons.
There is an inverted U-shaped relationship between the daily rented bike count and the daily average temperature. It implies there is an optimal temperature level which maximizes the number of rented bikes.
Using the two dashed vertical lines representing the daily average temperatures’ median and third quantile, we created a categorical variable that distinguishes the temperature levels :
Then we conducted a t-test to compare means between the levels ‘Medium’ and ‘High’.
| mean in group High | mean in group Medium | p-value | |
|---|---|---|---|
| Test results | 25 010 | 23 530 | 0.08812 |
As the p-value is higher than the 0.05 significance level, one may conclude there are no significant differences between the daily number of rented bikes depending on the ‘Medium’ and ‘High’ temperature levels.
In other words there is a similar pattern between the points located on both sides of the temperature which maximizes the number of rented bikes.
Nevertheless the p-value being quite low, if \(\alpha\) > 0.09 we shall reject the null hypothesis of equal means. In other words we couldn’t accept the hypothesis of equal means if we lowered the test’s level of confidence.
There is a growing linear relationship between the rented bikes per day and the average level of solar radiation.
Proceeding a linear regression on these two variables, we find that an increase of 0.1 MJ.m² in the level of solar radiation leads to an increase of about 2300 rented bikes per day. This result has to be nuanced since the average level of solar radiation is close to 0 and it is a variable which fluctuates little.
| Term | Estimate | Sd | T-statistic | p-value |
|---|---|---|---|---|
| (Intercept) | 4352.39 | 737.977 | 5.898 | < 0.001 |
| Sol_Rad_avg | 23132.254 | 1136.06 | 20.362 | < 0.001 |
Although R plots an inverted U-shaped relationship between the daily rented bike count and the average wind speed, the point cloud is scattered.
A decreasing relationship can be noticed when the wind speed starts to be felt.
A t-test to check whether the number of rented bikes is more important when the average wind speed is low - i.e. lower than the average wind speed’s median - we conducted a t-test.
| mean in group Low | mean in group High | p-value | |
|---|---|---|---|
| Test results | 18 490 | 16 490 | 0.02952 |
As the test p-value is lower than the significance level \(\alpha\) = 5%, the ‘greater’ alternative hypothesis can be accepted. In other words, people rent more bikes when there is little wind.
Coming back to the questions we asked at the beginning of our analysis, there is no doubt the Seoul share bike service is a success as shown by the t-test on the two samples ‘Dec17’ and ‘Nov18’.
We also asked ourselves what was the other variables’ influence on the number of rented bikes.
After having split our case study into two parts, we have found the rented bike count depends both on time-related variables and meteorological variables.
As regards the temporal ones, the more important use of the service during daytime and no-holiday period indicates the Seoul bike sharing service is work-oriented.
There are also more rented bikes during sunny months, especially in June, to such a point that the Summer season stands out from the others. Indeed the part on meteorological variables emphasizes an increasing relationship between the number of rented bikes and both the temperatures and the solar radiation level.
The coming part will aim at grouping variables which have similarities in order to avoid overfitting in the models we will estimate. We will also make classification of days based on their features. To this end we shall apply PCA and CA techniques on the ‘Seoul Bike’ data.